SeamlessM4T在轉譯過程中最重要的是Translator以及predict這兩個函式,故在本篇研究這兩個函式的架構。
在使用S2ST功能時第一件要做的事是,初始化一個轉譯器,所用的函式就是Translator,其程式碼如下:
class Translator(nn.Module):
'''
初始化需選用模型、聲碼器、gpu或cpu、資料型態float16或float32。
官方預設為
model_name_or_card:"seamlessM4T_large",
vocoder_name_or_card:"vocoder_36langs"
device:torch.device("cuda:0"),
dtype:torch.float16
'''
def __init__(
self,
model_name_or_card: Union[str, AssetCard],
vocoder_name_or_card: Union[str, AssetCard],
device: Device,
dtype: DataType = torch.float16,
):
super().__init__()
# Load the model.
if device == torch.device("cpu"):
dtype = torch.float32
## 選用模型("seamlessM4T_large"或"seamlessM4T_medium")
## 類型為UnitYModel:MetaAI開發的S2ST架構,先產生文本而後預測離散聲學單元(Unit)
self.model: UnitYModel = self.load_model_for_inference(
load_unity_model, model_name_or_card, device, dtype
)
## 導入文本及聲學單元的tokenizer(分詞器):將一大串的文本或聲學單元先分段,以利後續處裡
self.text_tokenizer = load_unity_text_tokenizer(model_name_or_card)
self.unit_tokenizer = load_unity_unit_tokenizer(model_name_or_card)
self.device = device
## 導入聲學解碼器
self.decode_audio = AudioDecoder(dtype=torch.float32, device=device)
## FilterBank將語音訊號做前處理,以符合人耳對聲譜的非線性響應,可以提高辨識率
self.convert_to_fbank = WaveformToFbankConverter(
num_mel_bins=80,
waveform_scale=2**15,
channel_last=True,
standardize=True,
device=device,
dtype=dtype,
)
## 蒐集分詞後的的index
self.collate = Collater(
pad_idx=self.text_tokenizer.vocab_info.pad_idx, pad_to_multiple=2
)
# Load the vocoder.
## 選用模型("vocoder_36langs")
## 類型為Vocoder:語音訊號分析合成系統,對聲音進行分析與合成的系統,主要應用於合成人類語音
self.vocoder: Vocoder = self.load_model_for_inference(
load_vocoder_model, vocoder_name_or_card, device, torch.float32
)
初始化轉譯器後,就可以來使用轉譯功能了,用predict這個函式,所需的引數為語音檔路徑,設定任務(有”s2tt”、”s2st”、”t2tt”、”t2st”、”asr”五種)及目標語言。我們來看看predict這個函式的結構:
def predict(
self,
input: Union[str, Tensor],
task_str: str,
tgt_lang: str,
src_lang: Optional[str] = None,
spkr: Optional[int] = -1,
ngram_filtering: bool = False,
sample_rate: int = 16000,
text_max_len_a: int = 1,
text_max_len_b: int = 200,
unit_max_len_a: Optional[int] = None,
unit_max_len_b: Optional[int] = None,
) -> Tuple[StringLike, Optional[Tensor], Optional[int]]:
"""
The main method used to perform inference on all tasks.
:param input:
Either text or path to audio or audio Tensor.
:param task_str:
String representing the task.
Valid choices are "S2ST", "S2TT", "T2ST", "T2TT", "ASR"
:param tgt_lang:
Target language to decode into.
:param src_lang:
Source language of input, only required for T2ST, T2TT tasks.
:param spkr:
Speaker id for vocoder.
:returns:
- Translated text.
- Generated output audio waveform corresponding to the translated text.
- Sample rate of output audio waveform.
"""
# task需填入五種限定的任務名稱,大小寫都可以,程式碼統一取其大寫,若填入五種外的
# 任務名稱,則會出現錯誤訊息Unsupported task
try:
task = Task[task_str.upper()]
except KeyError:
raise ValueError(f"Unsupported task: {task_str}")
# 從任務名稱取得input的類別:speech 或 text
input_modality, output_modality = self.get_modalities_from_task(task)
# 若輸入為語音類別,則先將音檔訊號作解碼
if input_modality == Modality.SPEECH:
audio = input
if isinstance(audio, str):
with Path(audio).open("rb") as fb:
block = MemoryBlock(fb.read())
decoded_audio = self.decode_audio(block)
else:
decoded_audio = {
"waveform": audio,
"sample_rate": sample_rate,
"format": -1,
}
src = self.collate(self.convert_to_fbank(decoded_audio))["fbank"]
# 若輸入為文本類別。則先初始化一個分詞編碼器,任務為translation,語言為輸入語言
else:
if src_lang is None:
raise ValueError("src_lang must be specified for T2ST, T2TT tasks.")
text = input
self.token_encoder = self.text_tokenizer.create_encoder(
task="translation", lang=src_lang, mode="source", device=self.device
)
# 將編碼器對輸入文本編碼
src = self.collate(self.token_encoder(text))
# 使用get_prediction函式,回傳預測結果
result = self.get_prediction(
self.model,
self.text_tokenizer,
self.unit_tokenizer,
src,
input_modality,
output_modality,
tgt_lang=tgt_lang,
ngram_filtering=ngram_filtering,
text_max_len_a=text_max_len_a,
text_max_len_b=text_max_len_b,
unit_max_len_a=unit_max_len_a,
unit_max_len_b=unit_max_len_b,
)
# 文本結果為result的第0行,聲學單元結果為result的第1行
text_out = result[0]
unit_out = result[1]
# 如果輸出結果的類別為文本,則回傳轉譯結果之文本
if output_modality == Modality.TEXT:
return text_out.sentences[0], None, None
# 如結果類別不是文本,還是可以回傳轉譯結果之文本,此外還回傳語音結果及取樣頻率
else:
units = unit_out.units[:, 1:][0].cpu().numpy().tolist()
wav_out = self.vocoder(units, tgt_lang, spkr, dur_prediction=True)
return text_out.sentences[0], wav_out, sample_rate
回顧一下SeamlessM4T轉譯程式碼架構,首先需要先初始化一個轉譯器Translator,選擇模型"seamlessM4T_large",其類型為UnitYModel。功能方面具備文本及聲學單元分詞器,作為預測前的分詞分段處理,也有聲頻訊號解碼器及合成聲碼器。準備好一個轉譯器後就可以開始預測,而predict這個函式,根據任務及輸入類別(speech或text)、文本分詞、聲學單元分詞等參數,可以產生轉譯結果的文本、語音檔及取樣頻率。
上述的流程為不管哪種任務都共用的架構,根據不同的任務(”s2tt”、”s2st”、”t2tt”、”t2st”、”asr”)會有不同的資料處理、轉換及預測流程,接下來將先針對S2TT(Speech-to-Text Translation)的流程程式碼結構作研究,其中最主要部分為UnitYModel模型以及get_prediction函式。